58 ◾ Bioinformatics
of all suffixes of a given sequence. There is a variety of algorithms used for constructing
the suffix array implemented by software packages [7]. In its simplest form, a suffix array
is constructed for a sequence or a string. For the string (sequence) S with a length N and
the characters (bases) indexed as 1, 2, 3, …, N, we can construct an array for all suffixes or
substrings of the sequence as S[1…N], S[2...N],…, S[N...N] and then sort the suffixes lexi-
cographically. For instance, for the sequence S = “CTTGGCTGGA$”, we can construct the
arrays of suffixes and sort them as shown in Figure 2.6.
Figure 2.6 shows how the suffix array is constructed from sorted suffixes. The numbers
are the positions in the sequence. The sorting of suffixes lets suffixes beginning with the
same string of characters to appear one after the other and that allows a fast lookup when
we try to find exact matches of a read (substring). For instance, the exact match for the
reads “TGG” can be found by jumping to the sorted suffixes that begin with “TGG” and
we can fast locate the positions 7 and 3 (the coordinate begins from 1 and not 0 as in suf-
fix tree). When a reference genome is indexed using the suffix array, finding a position of
a pattern or a read will have a linear time complexity. A major drawback of using suffix
arrays is that they require a large memory storage depending on the size of the genome
being used. STAR [8] is as an example of the aligners that use suffix arrays.
2.2.4 Burrows–Wheeler Transform
The Burrows–Wheeler Transform or shortly BWT is a data structure algorithm that trans-
forms a string (sequence) into a compressible form that allows fast searching. The BWT is
used by BWA [9], Bowtie [10], and other aligners. BWT algorithm has been used by Unix
and Linux for data compression as bzip2 compression utility.
Let S be a sequence of characters of a sequence or string of a length n. For a genomic
sequence, a sequence is made up of A, C, G, T, or N for an ambiguous base. Let $ also be a spe-
cial single character as a trailing empty end of the sequence S (e.g., S = “CTTGGCTGGA$”).
The Burrows–Wheeler transform of the sequence S or bwt(S) is computed by generating
cyclic rotations of the sequences, as shown in the first column of Figure 2.7. The cyclic
FIGURE 2.6 Suffix array.